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Abstract 

Background All-atom crystallographic refinement of proteins is a laborious manu- 
ally driven procedure, as a result of which, alternative and multiconformer interpreta- 
tions are not routinely investigated. 

Results We describe efficient loop sampling procedures in Rappertk and demon- 
strate that single loops in proteins can be automatically and accurately modelled with 
few positional restraints. Loops constructed with a composite CT<!S / Rappertk protocol 
consistently have better Rfree than those with CNS alone. This approach is extended 
to a more realistic scenario where there are often large positional uncertainties in loops 
along with small imperfections in the secondary structural framework. Both ensemble 
and collection methods are used to estimate the structural heterogeneity of loop regions. 

Conclusion Apart from benchmarking Rappertk for the all-atom protein refinement 
task, this work also demonstrates its utility in both aspects of loop modelling - building 
a single conformer and estimating structural heterogeneity the loops can exhibit. 



*This document is very similiar to a chapter in SG's PhD thesis submitted in Sept. 2007 to the University 
of Cambridge, England. 



1 Introduction 



X-ray crystallography has been the most popular protein structure determination technique 
of both pre- and post-genomic eras. The challenges of macromolecular crystallography are 
manifold - after the difficult steps of expression, purification, crystallization and data col- 
lection, there remains the final and important task of data interpretation in order to build 
a model which explains the observed diffractions. Structural interpretation requires over- 
coming the phase problem and often starts with partial and incorrect phases. Typically, 
semi-automatic iterative refinement is carried out, gradually improving the model's quahty 
as indicated by the R and Rfree factors as well as decrease in covalent geometry and excluded 
volume violations. Although excellent softwares like CCP4 (CCP4 ( 1994[ )), Phenix (Adams 



et al. (2002)) and cns (Brunger et al. (1998)) make this task possible, the structure refine- 



ment procedure remains manually-driven hence laborious and subjective. Due to this, the 
heterogeneity in structural interpretation of diffraction data is often ignored in favour of a 
single-conformer isotropic B-factor model. 

Protein structure is important for its function. But very stable, rigid proteins cannot ex- 
hibit enzymatic activity. This suggests that proteins have to be stable enough to retain their 
fold yet dynamic enough to be functional. Both experimental and computational studies 
indicate that single-conformer interpretation of crystallographic data is not adequate to cap- 
ture the native state dynamics which is largely conserved even in a crystal owing to its high 



solvent content (Petsko (1996), Jensen (1997)). Reporting a multiconformer interpretation of 
data will make use of the structure less misleading, especially in the analyses that depend on 
geometry such as shapes of binding sites, orientations of sidechains, detection of non-covalent 
interactions and so on. While multiple interpretations are necessary, they should be free from 
any bias such as that introduced when different crystallographers solve the same diffraction 
data. Multiconformer interpretation will be greatly facilitated by automated methods. 

Thus multiple persuasive justifications emerge for automating the protein crystallographic 
refinement task: (a) capturing the dynamics of protein in the crystalhne state (b) removing 
subjective bias from the refinement process and (c) reducing the need for precious human 
resource. But this goal is hard to achieve in practice. The under-determined nature of the 
problem (number of independent observation < number of parameters) prevents a straightfor- 
ward solution by minimization. Even when sufficient restraints exist, minimization methods 
hke conjugate gradient, steepest descent etc. suffer from the problem of local minima. Hence 
use of well-known features of proteins is unavoidable. Automatic pattern recognition in elec- 
tron density is very successful in presence of high resolution data and good phases because 
it looks for such features (Perrakis et al. (1997)). But at medium resolution or given poor 
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phases, this strategy can get misled. 

Our recent efforts with automated crystallographic refinement started with rapper, which 
is a conformation sampling program for proteins and uses a genetic algorithm cum branch- 



and-bound (gabb) algorithm. DePristo et al. (2004) showed that multiple interpretations 
similar to the deposited structure are possible given the deposited data, and the divergence 



in interpretation is correlated to resolution. With rapper, it was demonstrated (DePristo 



et al. (2005)) that when a protein structure is approximately known, it can be refined 



to native-hke quality, unlike mdsa in cns which may get stuck in local minima. Funda- 
mental features of rapper responsible for avoidance of local minima traps were (a) use of 
fine-grained, propensity-weighted 4> — ip maps for backbone sampling (b) use of backbone- 
dependent rotameric libraries (c) use of ideal Engh and Huber covalent geometry (d) mild 
use of electron density and positional restraints to guide the sampling process. Later |Furn 



ham et al. (2006) demonstrated that a low-resolution dataset can be rescued and interpreted 



semi-automatically to obtain structure of a system with great biological significance. 



DePristo et al. (2005) observed that automatic refinement becomes less satisfactory as 



positional restraints become weaker: the structures could not be refined if the initial Ca 
perturbation was of order of SA or more. This is not unexpected because larger positional 
restraints dilute the information and would make the search harder. But often a practical 
problem encountered in crystallography is that of missing loops, i.e. knowing loop regions 
with far less Ca positional certainty than the regions with regular secondary structure. By 
definition, loops exhibit rich variability in backbone torsion angles. They are thought to be 
more dynamic than the protein secondary structural framework and also functionally more 
interesting. Thus it is important to use the available restraints as efficiently as possible to 
build loops despite weaker electron density and greater positional uncertainty while tolerating 
small positional errors in the framework. 

After determining a single-conformer loop structure, the second important challenge is to 
estimate the structural variability of the loops. It is easy to see by generating artificial data 
that existence of structural heterogeneity for a loop results in confusing electron density. In 
general, partial occupancies result in weaker density than full occupancy. Sidechains of the 
same residue may occupy different density contours. Overlaps in conformations may lead to 
significant loss of shape information. These challenges can be expected to make the task only 
harder for minimization-based programs when refining a multiconformer structure. 

Following the reformulation of rapper as a versatile modular software called Rappertk 
(Gore et al. (2007)), it was essential to benchmark its performance for all-atom protein crys- 
tallographic refinement. Hence the first result in this work is the all-atom knowledge-based 
crystallographic refinement given the positional restraints for the entire protein, establishing 
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that a similar result as rapper (DePristo et al. (2005)) can be achieved. We then demon- 



strate that single loops in proteins can be reconstructed to a high quality with Rappertk 
using little positional information. This case is then extended to include all loop regions and 
a small error in the framework to show that the composite CNs/i?appertk refinement approach 
is suitable in a realistic scenario. Finally, we ask whether single-loop heterogeneity can be 
modelled with collections of independently generated models or ensembles of conformers. 



2 Methods 

2.1 Overview of iterative refinement 

Each step in refinement procedure consists of: (a) identification of residues which do not fit 
well into density, (b) finding contiguous bands of such residues, (c) rebuilding the bands with 
knowledge-based conformational sampling within restraints, (d) optimal sidechain placement 
of rebuilt sidechains and (e) refining the resulting model with cns. 

The fit of a set of atoms to electron density is calculated as the correlation coefficient 
between the cxA-weighted omit map and Fc map for a region around lA of the atoms. The 
maps used are both generated by cns refinement script, hence they are described on the 



same grid. Following Kleywegt and Jones (1996) and DePristo et al. (2005), the correlation 



coefficient between the maps is calculated on the grid neighbourhood around atoms of interest: 

CorrCoef = (1) 

^ " omit ^ ^ c 



If the correlation is below 0.9, the atoms are fiagged for rebuilding. Correlation is cal- 
culated on all atoms in residues and then on mainchains only, sidechains only and peptide 
atoms only. Ill-fitting sidechains are marked for sidechain reassignment whereas residues with 
ill-fitting peptide or mainchain or all-atoms are marked for all-atom reconstruction. 

Once the residues are fiagged for all-atom rebuilding, contiguous bands are identified and 
marked either as N-terminal, C-terminal or intermediate. Bands are then sampled in random 
order using the PopulationSearch algorithm. Each band is attempted 5 times and left as it 
was if it cannot be sampled within given restraints. Previously sampled bands are considered 
while sampling later bands. N and C terminal bands are built using forward and reverse 
techniques and weighted samphng oi (p — ip propensities. Building of intermediate regions is 
described in a later section. 

Once all bands are sampled, all the resampled sidechains are reassigned using the optimal 



sidechain placement procedure described elsewhere (Gore and Blundell (2007)). 



4 



2.2 Electron Density Ranker 

Generally a single conformer model would refine better if its occupation of the 2Fo — map 
is better - so a model within la contour is more rehable than 0.5a. Thus, on output of 
each builder, a binary electron density restraint can be applied with a cx-level cutoff. But 
the quality of map is not uniform over all residues and hence such binary restraint is useful 
only for ensuring that model remains within positive electron density. Hence, in addition to 
that restraint, an analog ranker is used for electron density that ranks the possibilities and 
chooses the better ones. In the population search algorithm, the ranker asks more children to 
be generated at each conformation extension step than the population size (typically 5 times 
more), ranks them and chooses top-ranking ones to fill the conformation pool. The ratio of 
number of children generated to the population size is termed as the enrichment ratio. The 
electron density ranking scheme calculates score of a set of atoms by summing up the a values 
m a lA region around their coordinates. The effective a value is calculated by penalising the 
negative a and flattening the peaks by upper cutoff, the latter for better recognition of shape 
of density rather than spikes, say due to waters or ions. In addition to filtering children at 
each step, the ranker also chooses the best member from the conformational pool generated, 
which is returned as the sampled model for the band. 



2.3 Symmetry-related clash cheking 

As described in [Gore et al. (2007), Rappertk uses geometric caching implemented as Clashcheck- 
ing grid for efficiently deciding whether atomic van der Waals spheres are overlapping. This 
excluded-volume restraint rules out many unproductive sampling trajectories. When loop 
positions are largely uncertain, the existence of symmetry-related images of the molecule 
around it acts as an excluded volume restraint to loop sampling. Rappertk uses the Clipper 
(Cowtan (2003)) hbraries for crystallographic computing for symmetry-related calculation. 
Clashchecking grid uses Chpper's Spacegroup class and symmetry operators therein to cal- 
culate the images of atoms to be added into the grid. Images within 20A of the bounding 
box of given protein coordinates are considered. First the grid looks for any clashes between 
sampled coordinates and their images. Then it is verified that they do not clash against the 
rest of the coordinates or their images. In case of no clashes, the new coordinates and their 
images are added to the grid. Removing coordinates from the grid removes their images too. 



2.4 Loop closure 

The typical incremental sampling step in Rappertk buflds and (z — 1)*'' sidechain in the 
forward mode or and {i + 1)*^ sidechain in the reverse mode. In this context, loop closure 
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can be formulated as finding the locations of mainchain atoms {C*"^, O*"^, N^, C^, C*, O*, 
A^*+^} and sidechains of {i — and {i + 1)*'* residues. Seamless loop closure of this kind 

is challenging because many conditions need to be met: (a) the covalent angles and lengths 
should be correct (b) 0, states of 3 residues should be in the allowed regions (c) two uj 
angles should be adopt cis or trans conformation, but not be restricted to one or the other 
(d) 3 sidechains should be rotameric and (e) van der Waals restraints should be obeyed. 

A sampling procedure was devised for meeting conditions (a), (b), (c) and (d), while (e) is 
met using clash-checking restraints. The sampling procedure is similar to the one described 
in Gore et al. (2007), but modified to meet the cis to state too. First, the two to angles 
are sampled, leading to the center, plane and radius of the circle on which the middle Ca is 
sampled. The circle is uniformly sampled. For each sample, the {r, a, 6,} — {0, ip, u} mapping 
is used to build the mainchain atoms. Then the three sidechains are sampled from a rotamer 
library. Sampling is continued until a conformation satisfying all restraints is found. 

The problem with this samphng is that the restraint density abruptly increases at loop 
closure because it is not clear how to back-propagate the geometric requirements (a) and 
(c). Due to this, the sampling procedure fails often and takes a long time to find a valid 
sample. Often an incorrect conformation is built in case of imperfect density because sam- 
phng of 0, V',!^ is not propensity-weighted. Hence after significant experimentation with this 
approach, it was abandoned in favour of a simpler approximate approach. 

In the simpler approach, the loop closure is formulated as finding the coordinates of 
mainchain atoms {C*, O*, A^*+^} and sidechains of residues i and i + A (p sampler is used 
to build the C* atom which is required to he between 0.5 to 2A from the A^*"*"^ atom. Covalent 
angles A^* — — C and — A^*+^ — C^+^ are restrained to he between 90° and 150°. u 
dihedral angle — — A^*+^ — C^+^ is allowed a maximum deviation of 30° from cis or 
trans conformation. Two sidechains are sampled for each sampled. It is observed that it 
is more efficient to close a loop with this method than the previous. 

2.5 Both-sided sampling 

Bands to be rebuilt can be of three types: the N terminal band, the C terminal or interme- 
diate. For the C and N terminal bands, forward and reverse sampling are used respectively. 
For the intermediate regions, the most efficient way is to use a both-sided sampling approach 



as opposed to only forward sampling. As explained previously (see Gore et al. (2007) in 
the context of /3-sheet sampling), in both-sided samphng, residues are sampled in the order 
i,k,i + l,k — l,i + 2,k — 2, .... In case of forward or reverse samphng, only a weak loop-closure 
distance restraint informs the samphng process of the other end of the loop, but with both- 
sided sampling, information at both N and C termini is actively used. A distance restraint is 



6 



used between Ca atoms at the same sequence distance from both termini, so that the chance 
of loop closure remains high despite both sided sampling. Initial experiments with crystal- 
lographic loop building clearly showed that refinement was better with both-sided samphng 
than forward-only sampling, especially with larger loops and weaker positional restraints. 
Thus, in this work, we have used forward-only, backward-only and both-sided sampling for 
C terminal, N-terminal and intermediate bands respectively. 

2.6 Multiconformer sampling 

This type of sampling constructs many conformations of the same band simultaneously. 
Instead of incrementally sampling one band, multiple models of the band are extended si- 
multaneously. This is achieved by re-implementing the PopulationSearch algorithm in its 
plural form in which each builder is replaced by a set of builders that have same input and 
output atoms in different models. Clashchecking is not performed across models. Electron 
density ranker uses the combined output of a set of corresponding builders to calculate the 
score of a child conformation. Due to this, the possibility of getting attracted into higher 
density is reduced and the chance of occupying the density generated due to genuine het- 
erogeneity increases. The disadvantage of this kind of sampling is the obvious increase in 
conformational freedom and execution time. 



3 Results 



3.1 Reproducing rapper/cns refinement 



The utility of knowledge-based refinement has been demonstrated by DePristo et al. (2005) 



with automatic refinement of perturbed starting structures of 9ILB and 1KX8 to an Rfree al- 
most same as the deposited structure. When that refinement protocol was closely reproduced 
in Rappertk, very similar results were obtained. Five proteins were selected in the 2A-3A 



resolution range from the pdb: 9ILB (Yu et al (1999)), 1KX8 (Lartigue et al (2002)), IMBl 



(Taylor et al. (1997)), IBYW (Cabral et al (1998)) and 1RN7 (Alvarez-Fernandez et al. 



(2005)). Five perturbed structures were generated for each of them within 2 A Ca and 3 A 
sidechain centroid restraints respectively. 20 rounds of both CNS-only and CNs/i?appertk 
refinement protocols were performed on these starting models to obtain the Rfree statistics 
summarized in Table [!} Rjree figures reported for 9ILB and 1KX8 by DePristo et al. (2005) 
for same restraints were 0.27(0.01) and 0.32(0.01) respectively - the corresponding statistics 
observed here are comparable. 
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Tabic 1: Full-protcin test set and refinement statistics for 5 starting models generated within 
2A Cn, and SA sidechain restraints. 
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Table [T] shows the variation in all-atom rmsd and xi, X12 values as function of resolution. 
The reported rmsd is the average of pairwise unsuperposed rmsd between the deposited and 
each of the composite models and thus can be said to indicate the inaccuracy in retrieving 
the deposited model from an approximate starting model. This inaccuracy does not seem to 
be sensitive to the resolution, suggesting that at least in the 2.lA-2.8A resolution range, an 
approximate model can be corrected to a similar quality with respect to the deposited one 
irrespective of the resolution. 

When the models are compared among themselves in a pairwise manner, the average 
RMSD and Xi, Xi,2 figures can be said to represent heterogeneity. This heterogeneity is shghtly 
lower than the inaccuracy, but the difference is insignificant, i.e. each model is as far from the 
deposited structure as from any other model. Recently the heterogeneity defined similarly 
has been suggested to be the minimum uncertainty expected in the coordinates of a single- 
conformer model of that structure (Terwilhger et al. (2007)). 

An ideal refinement method should start with approximate models and yield a set of high 
heterogeneity models each of which agrees at least as well with the data as the deposited 
structure, i.e. a combination of results in DePristo et al. (2004) and DePristo et al. (2005). 
Clearly, the protocol used is similar to DePristo et al. (2005) and perhaps expectedly, does 
not yield greater heterogeneity than inaccuracy. But when the models are assigned partial 
occupancies and combined to create a multiconformer model (Figjij Section 3.4), the col- 
lection Rfree values for IMBl, IBYW and 1KX8 drop significantly by 1.5 — 2% than the 
deposited structure. This drop suggests that perhaps structural heterogeneity is captured to 
some extent. 



3.2 Rebuilding missing loops 

Five structures of various resolutions and no obvious homology were selected from the pdb: 
IMBl dTaylor et a/.| ( p97l )), IBYW ( |Cabral et a/.] ( [l998| ), IKXB ( |Choi et a/.| ( p96| ), 1RN7 
(Alvarez-Fernandez et al. ( 2005[ )) and 2DB0 (Ishii et al. ( |2005 )). All structures have a single 
continuous peptide chain between 100-200 residues and no hgands. For each structure, a loop 
at least 10 residues long was chosen for rebuilding (Table [2]). Unhke the previous exercise, 
there are no positional restraints on loop sidechains. Ca atoms are positionally restrained, in 
the first case with 5A restraints and later with loA restraints. The loops were rebuilt using 
the both-sided loop samphng and loop closure techniques within the iterative refinement 
protocol. For 4 of 5 loops considered, the 5A perturbation of the loop can be corrected to 
within one point of the baseline Rfree- In the lOA case, this performance drops marginally 
to two points from the basehne Rfree for the same cases. 

Figjl] shows the large difference in the quality of CNS-only and CNs/i?appertk refinement 
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protocol. Every starting model refines to a structure very similar to the deposited using the 
composite protocol whereas it gets trapped in local minima during CNS-only refinement. The 
1RN7 case (FigM is unsatisfactory due to a difficult 5-residue segment in the loop (Pro-80, 



Asn-81, Leu-82, Asp-83, Asn-84). As noted by Alvarez-Fernandez et al. (2005), the density for 
this segment is confusing, perhaps due to underlying heterogeneity, and consistently misleads 
the band sampling into a conformation different from the deposited. 



3.3 Framework and loops 

Perhaps a more frequently encountered scenario than the previous two is the one in which 
both the secondary structure and loops have positional uncertainty. In such cases, the loop 
regions invariably are more unreliable than the secondary structure framework. In order to 
simulate this scenario, the framework was restrained to lA Ca and SA sidechain centroid 
restraints, whereas loops were restrained to SA Ca restraints and no sidechain restraints. 
Five models were built for each protein and then iteratively refined using both CNS-only and 
CNs/i?appertk protocols. The refinement statistics are summarized in Table |3} Note that 
the refinement composite refinement statistics are not as good as in the previous exercises, 
but still better than the CNS-only refinement. FigjS] shows a typical contrast between the 
CNS-only and the composite refinement protocols in this scenario. 



3.4 Variation of Rfree with collection size 

We define a collection as a set of independently refined structures which when taken together 
may capture some aspects of structural heterogeneity. This term is introduced to distinguish 
the collection from an ensemble (see Section 3.6) which is also a set of structures, but refined 
in an interdependent manner. 

For the previous exercises of refinement (whole chain, sA loop, lOA loop and loops with 
framework), single best Rfree models from five CNs/i?appertk trajectories are chosen and 
combined to create collections, e.g. a three model collection is created by choosing three 
lowest Rfree models from the five and assigning occupancy of 0.33 to each of them. A 
collection is subjected to a short cns refinement and Rfree at its end is noted as collection 
Rfree- Figji] shows the variation of such Rfree values as a function of collection size. The 
drop in Rfree is modest and the highest when going from collection size of 1 to 2 or 3. The 
collection Rfree generally rises for sizes 4 and 5. This indicates the danger of overfitting due 
to increase in number of parameters. Thus a straightforward combination of models does 
not seem to be the correct way of describing heterogeneity. Intelligent schemes for parameter 
reduction need to be investigated in this regard, such as upper-bounding 5-factors, enforcing 
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Table 2: Dataset for loop building and refinement statistics. 
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Figure 1: Loop building exercise for the IMBl loop with loA Ca restraints. Top panel shows 
the loop in the deposited structure (green) and starting models generated for it. Middle 
panel shows the best Rfree models (slate) obtained during the CNS-only refinement. Bottom 
panel shows the CNS/i?appertk models (magenta) and the loop in the deposited structure 
(green) in all-atom representation. 
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Figure 2: Loop building exercise for the 1RN7 loop with sA restraints. Panels arran; 
in a similar way to Figjl} 




Figure 3: Framework/loop exercise for IBYW. The deposited structure is shown as thick 
green ribbon and starting structures as brown ribbon in the top panel. Middle panel shows 
the best Rfree structures from CNS-only trajectories (slate) and bottom panel shows those 
from from the CNS / Rappertk trajectories (magenta). 
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Table 3: Dataset and refinement statistics for the loop-framework refinement 



PDB 


Resolution 

(A) 


#AA 


Loops 
#AA (%) 


basefine 


Rfree 

CNS-only 
Mean (Std.Dcv.) 


CNS/i?appertk 
Mean (Std.Dcv.) 


IMBl 


2.1 


98 


27 (28) 


0.292 


0.352 (0.012) 


0.325 (0.010) 


IBYW 


2.6 


110 


50 (45) 


0.292 


0.400 (0.046) 


0.321 (0.016) 


IKXB 


2.9 


158 


54 (36) 


0.283 


0.351 (0.022) 


0.336 (0.006) 


1RN7 


2.5 


112 


33 (29) 


0.270 


0.342 (0.032) 


0.328 (0.008) 


2DB0 


2.76 


148 


57 (39) 


0.289 


0.370 (0.020) 


0.335 (0.017) 



the same B-factor on corresponding atoms across all models and positionally constraining 
them together when large variability is not expected. 

3.5 Mistakes in the composite protocol 

Although the c^^ / Rapperfk refinement produces a well-refined structure very close to the 
baseline Rfree, it never betters the latter. This is due to imperfections in various components 
of the protocol. Identification of residues to rebuild relies on the correlation coefficient 
which can sometimes be an unsatisfactory substitute for human judgement. This can lead 
to unnecessary resampling of satisfactory bands and sometimes incorrect conformers are not 
detected. The copying of non-protein atoms from one round to next may sometimes result 
in their permanent misplacement. The sampling problem may result in unsatisfactory bands 
because the right conformation must be generated in order to be picked by the electron 
density ranker. On the other hand, the ranker may not score a correct conformation as the 
best one. If the density for a band is weak, band samphng may sometimes lead into the 
density of waters or ligands. In spite of these difficulties, the refinement statistics presented 
in previous sections are satisfactory. 

But when restraint radii are increased beyond those used here, the serious problem of 
spatial overlap between restraint spheres of two different bands starts affecting the band 
sampling. The correct density for a band then may be occupied by a band sampled before it. 
In such case, the correct density is always occupied by the wrong band. For the wrong band, 
subsequent cns refinement may take it so far from its correct location that the restraint radii 
may be too small to let the band be built correctly again. Further work is required to get rid 
of the problem of band overlaps, either by restraint adjustments or change in the samphng 
strategy. 



15 



Figure 4: Variation of Rfree with collection size in whole chain, loop and framework-loop 
exercises. For collection size of 0, baseline Rfree is shown. For collection size 1, the mean 
and standard deviation of best Rfree models in S / Rapper tk trajectories are shown. The 
rest are calculated by combining the best Rfree individual models with partial occupancies. 
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3.6 Modelling loop heterogeneity with collections and ensembles 

Conformational diversity is most pronounced in loop regions due to the relatively smaller 
number of non-covalent interactions to maintain order. Absence of good density for a loop 
when rest of the structure has good density is a sure indication of the loop's flexibility. 
Modelling heterogeneity is challenging because density is generally more confusing for such 
regions owing to conformer overlaps and subsequent dilution of shape information. 

Heterogeneity can be modelled with collections or ensembles. Members of a collection 
are single-conformer isotropic B-factor models determined independently of one another. 
Members of an ensemble are determined in a highly interdependent manner and have partial 
occupancies. 

Derivation of a collection is a simple way to estimate the unavoidable uncertainty in 
structure determination, but it cannot be said to represent any structural correlations. A 
major advantage of collections is their simplicity. A procedure that produces a single model 
can be executed multiple times with different random seeds or starting models to generate a 
collection. Thus the time taken increases only linearly as the collection size. 

An implicit assumption in the ensemble representation is that the members are in dynamic 
equilibrium, making ensemble a much stronger statement than the collection. Determining 
ensembles is very challenging because it is unclear how to determine the number and occu- 
pancies of the ensemble members prior to or during the reflnement process. Another major 
challenge is the linear increase in the number of parameters which results in an exponential 
increase in search space and execution time. 

In order that its output be credible, any procedure that aims to model the structural 
heterogeneity must be flrst validated using artificial data where the real heterogeneity is 
accurately known. To that end, we have chosen a significantly simplified kind of heterogeneity 
by generating artificial diffraction data in which the underlying heterogeneity is restricted to 
a single loop and consists of two equally-occupied loop conformations. For a loop each from 
IMBl, IBYW and 2DB0, two conformers were generated by perturbing the loop to within 
SA Ca restraints and no sidechain restraints. All non-protein atoms (ions, waters etc.) have 
been removed so as to reduce the density dilution. Artificial diffraction data were created 
with the same cell, spacegroup and resolution as the deposited structure. For self-consistency 
in the cns forcefield, data was generated iteratively. The average of the two conformations 
was considered as the starting conformation for further heterogeneity modelling. 

A collection of 4 members was generated for each loop with c'!<is / Rapper tk protocol used 
previously. An ensemble consisting of 2 members was generated for each loop using multi- 
conformer sampling described previously. As with the single-conformer protocol, multicon- 
formers were sampled iteratively and alternatingly with cns. Enrichment was increased to 10 
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and population size to 200 to be able to build reasonable models. Positive electron density 
restraint was enforced on mainchain. 

The performance of collections and ensembles can be visually inspected in Figj5| Fig|6] 
and Figj7]but it is essential to quantify quality of heterogeneity modelling objectively. The 
two important quantities to measure are: the extent to which both conformers are captured 
and the extent to which at least one conformer is captured. The former (multiconformer 
quality index, MQI) should quantify how much of the underlying diversity is represented and 
the latter (single-conformer quality index, SQI) should quantify how well at least one of the 
heterogenous states is modelled. 

If conformers Hi constitute the true underlying heterogeneity and Mi are the ensemble 
or collection members which model it, then MQI and SQI can be calculated as: 

MQI = Y.^^^ji^^^(^(Hi,Mj)) (2) 

i 

SQI = mini_j{Rmsd{IIi, Mj)) 

where rmsd is calculated over the atoms of interest, e.g. a sidechain or all Ca atoms in 
the loop. Note that these expressions do not consider the occupancies. Table [4] quantifies 
the performance of ensembles and collections using Rfree-, MQI over loop Ca atoms, and 
MQI, SQI over each sidechain. The ensemble Rfree values are smaller than those for collec- 
tions as expected due to greater number of parameters. Ca MQI suggests that mainchain 
heterogeneity is modelled better in the ensemble method. Sidechain SQIs do not show any 
systematic difference between the two methods, which suggests that both methods capture 
the rotameric heterogeneity to a similar extent. But sidechain MQIs tend to be shghtly better 
for ensemble than collection. This is not surprising because in principle, the only limitations 
on the ensemble method are sampling and ranking of conformational extensions. Generally 
the higher density option is chosen in single-conformer modelling but a lower density can 
also be chosen in multiconformer modelling due to a greater number of atoms to place in 
the density. This is evident from residues Lys-72 and Tyr-73 in IMBl, Glu-118, Asp-119 in 
IBYW and Lys-86, Lys-87 in 2DB0 where collection models are biased towards one of the 
loop conformations due to weak density. 

4 Concluding Discussion 

The problem of automated crystallographic refinement is interesting, challenging and has 
significant immediate practical relevance. An automated solution for building single con- 
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Figure 5: Heterogeneity modelling with collection and ensemble for a loop in PDB IMBl. 
The top panel shows the artificially generated loop heterogeneity with corresponding electron 
density contoured at la. The middle and bottom panels respectively show a 4-member 
collection and 2-member ensemble model of that heterogeneity. 
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Figure 6: Heterogeneity modelling with collection and ensemble for a loop in PDB IBYW. 
Panels arranged as in Figj5j 
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Figure 7: Heterogeneity modelling with collection and ensemble for a loop in PDB 2DB0. 
Panels arranged as in FigjS] In the top panel, the red contours correspond to 0.5a. 




Table 4: Comparison of collection and ensemble heterogeneity modelling styles with artifi- 
cially generated 2-conformer heterogeneity for 3 loops. MQI and SQI are in A units. 
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former models from approximate spatial restraints, combined with an automated method 
to explore heterogeneity, will allow the crystallographic community to revisit and annotate 
the entire pdb with structural variability information. Such information will have an im- 
pact on all analyses that rely on accurate coordinate information, like in-silico ligand design 
and binding, non-covalent interactions and sequence-structure conservation, etc. It will also 
significantly change the understanding of crystalhne state and protein flexibility, benefltting 
the refinement process. The main components for successful heterogeneity annotation are 
reliable construction of single-conformer models and reliable estimation of heterogeneity that 
is predominantly seen in loops. This work has attempted to develop methods to that end. 



Apphcation of the rapper approach to crystaUographic refinement (DePristo et al. (2005), 



Furnham et al. (2006)) has the primary benefit of crossing the energy barriers in a non-random 



manner, based on knowledge-based samphng instead of kinetic sampling. As described in 



Gore et al. (2007), rapper has been reformulated as Rapperik recently, creating possibilities 



of applying knowledge-based sampling in many different ways. In this work, we showed 
that Rappertk can be used to automatically refine the whole protein structure starting from 
reliable positional restraints on mainchain and sidechain. This benchmarked its performance 



vis-a-vis rapper as reported by DePristo et al. (2005) for a similar task. Then we showed 
that by efficient use of available restraints, single loops can be modelled in protein structures 
to native-like quality with few positional restraints. The efficient use of available restraints 
was a result of symmetry-related clashchecking, restraint propagation using loop anchors and 
sampling from both anchors simultaneously. The same strategy could be extended to a more 
realistic problem of a large uncertainty in loop regions and an imperfect secondary structure 
framework. The CNS-only refinements, run as controls, showed the value added by Rappertk 
to the refinement task. 

In addition to determination of single-conformer models under differing restraint qualities, 
we started addressing the challenge of heterogeneity assessment in loops. This is indeed a very 
difficult problem with fundamental unknowns like number of conformers, correlations within 
heterogeneity and relative occupancies. Conformational heterogeneity can be divided into two 
types: the simpler sidechain-only heterogeneity where mainchain is nearly the same and the 
all-atom heterogeneity where mainchain also takes distinct conformations. The latter can be 
further divided based on the extent of spatial overlap between the conformers. Sidechain-only 
heterogeneity is relatively easy than the all-atom heterogeneity because the density is likely 
to contain good cues about diversity. But for overlapping conformers, a visual inspection 
of density is less likely to be helpful. There are two distinct ways to model heterogeneity, 
which we have termed collections and ensembles, depending upon interdependence of member 
conformations. For single-loop 2-conformer overlapping heterogeneity, we generated both the 
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collections and ensembles and assessed how well they modelled the heterogeneity. The main 
observation was that the collection was generally biased towards the higher electron density. 
The ensemble method, due to more freedom and parameters available to it, manages to avoid 
this trap and fit two distinct conformers, leading to better modelhng of heterogeneity than 
the collection. 

Various pitfalls of the composite refinement protocol were recognized and they need to be 
addressed in future. Addressing the problem of overlapping bands will significantly increase 
the reliability of the method given very approximate positional restraints, and make the 
method more useful in low resolution, large uncertainty cases. Perhaps many models can be 
generated for such bands and the best combination of those models can be used. At lower 
resolution, use of coarse-grained sampling (fragment sampling) may also be useful, followed 
by fine-grained (p — ip — x sampling. 

From the heterogeneity perspective, a fundamental question to address would be the es- 
timation of the nature of underlying conformations before attempting to model it because 
ensemble sampling must have prior knowledge of the number and occupancies of its members. 
Generation of collections seems the only way for such estimation, for which collection mod- 
elling method will have to be modified suitably to sample within electron density yet avoid 
the bias towards higher density. The main challenge of ensemble sampling is the explosion 
in conformational freedom and the work ahead will have to focus on efficiently scaling this 
sampling method for larger ensemble size. 
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